In [ ]:

    
%%HTML
<style>
.container { width:100% } 
</style>

Simple Linear Regression with TensorFlow

We need to read our data from a csv file. The module csv offers a number of functions for reading and writing a csv file.



In [ ]:

    
import csv

The data we want to read is contained in the csv file 'cars.csv'. In this file, the first column has the miles per gallon, while the engine displacement is given in the third column. We convert miles per gallon into km per liter (1 mile = 1.60934 kilometres, 1 gallon = 3,78541 litres)) and cubic inches into liters (1 cubic inch = 0.0163871 litres).



In [ ]:

    
with open('cars.csv') as cars_file:
    reader       = csv.reader(cars_file, delimiter=',')
    line_count   = 0
    kpl          = []
    displacement = []
    for row in reader:
        if line_count != 0:  # skip header of file
            # miles per gallon is in first column 
            kpl         .append(float(row[0]) * 1.60934 / 3.78541) 
            # engine displacement is in third column
            displacement.append(float(row[2]) * 0.0163871)  
        line_count += 1
print(f'{line_count} lines read')

Now kpl is a list of floating point numbers specifying the fuel eficiency, while the list displacement contains the corresponding engine displacements measured in cubic inches.



In [ ]:

    
kpl[:5]

The fuel consumption is the inverse of the variable kpl. The variable lph gives the number of liters needed to drive 100 kilometres.



In [ ]:

    
lph = [ 100 / x for x in kpl]



In [ ]:

    
lph[:5]

Yes, these old American cars had a terrible fuel efficiency. But a look at the engine displacements gives us a clue about what is going on.



In [ ]:

    
displacement[:5]

The number of data pairs of the form $\langle x, y \rangle$ that we have read is stored in the variable m.



In [ ]:

    
m = len(displacement)
m

In order to be able to plot the fuel efficiency versus the engine displacement and we turn the lists displacement and lph into numpy arrays.



In [ ]:

    
import numpy             as np
import matplotlib.pyplot as plt
import seaborn           as sns



In [ ]:

    
X = np.array(displacement)
Y = np.array(lph)



In [ ]:

    
plt.figure(figsize=(12, 12))
sns.set(style='whitegrid')
plt.scatter(X, Y, c='b')
plt.xlabel('engine displacement in litres')
plt.ylabel('litre per 100 km')
plt.title('Fuel Consumption Versus Engine Displacement')

Next, we want to show how linear regression can be formulated as a minimization problem and how this minimization problem can be solved using TensorFlow.



In [ ]:

    
import tensorflow as tf

This example differs from our first example as this time the function that we want to minimize depends on a set of training data. Therefore, we have to define placeholders to insert our data into TensorFlow. We define a placeholder for the independent variable displacement and a placeholder for the dependent variable lph.

As we do not want to hardwire the number of examples, we set the shape of these placeholders to None.



In [ ]:

    
X_ph = tf.placeholder(tf.float32, shape=(None,))
Y_ph = tf.placeholder(tf.float32, shape=(None,))

We have a linear model to predict the fuel consumption from the displacement. This linear model is as follows: $$ Y = \vartheta \cdot X $$ Here $X$ is the engine displacement, while $Y$ is the fuel consumption. Note that this linear model does not include a bias. The reason is that this bias should be $0$ as a car without an engine won't use any fuel.

A first guess for $\vartheta$ would be the average fuel consumption divided by the average engine displacement:



In [ ]:

    
theta_initial = np.mean(Y) / np.mean(X)
theta_initial

$\vartheta$ is the variable that we want to find. Hence we declare it as a TensorFlow Variable.



In [ ]:

    
ϑ = tf.Variable(theta_initial, dtype=tf.float32)

The loss function is defined as the sum of the squares of the errors. In order to normalize the loss, we divide it by the number of training examples $m$. $$ \texttt{loss} := \frac{1}{m} \cdot \sum\limits_{i=1}^m \bigl(\vartheta \cdot x_i - y_i\bigr)^2 $$ Here $x_i$ is the engine displacement of the $i$-th training example, while $y_i$ is the fuel consumption of this training example. Our goal is to determine the value of $\vartheta$ that mimimizes this loss function.

The function square takes an array and squares it elementwise. The function reduce_sum computes the sum of all elements of an array.



In [ ]:

    
loss = tf.reduce_sum(tf.square(ϑ * X_ph - Y_ph)) / m
loss

We will use gradient descent to minimize our loss function. After some experimentation, I have chosen a learning rate $\alpha$ of $0.03$:



In [ ]:

    
α         = 0.03
train     = tf.train.GradientDescentOptimizer(α)
optimizer = train.minimize(loss)

Finally, we can start a TensorFlow session and run our optimizer for 11 steps of gradient descent. Observe how we have used the dictionary data_dict to feed the training data into our optimizer.



In [ ]:

    
init = tf.global_variables_initializer()
with tf.Session() as s:
    s.run(init)
    data_dict = {X_ph: X, Y_ph: Y}
    for k in range(9):
        s.run(optimizer, data_dict)            # one step of gradient descent
        theta, l = s.run([ϑ, loss], data_dict) # evaluate the variable ϑ and the loss function
        print('%2d: ϑ = %f, loss = %f' % (k, theta, l))

We can conclude: For a car from the seventies or early eighties that has an engine displacement of $d$ litres, the fuel consumption is about $3.18 \cdot d$ litres per 100 kilometres.

If we compare this notebook to the notebook Simple-Linear-Regression.ipynb that we had developed at the beginning of this lecture we notice the following:

In the notebook Simple-Linear-Regression.ipynb we had to derive a formula to compute the minimum of the loss function.
In the current notebook, we just had to specify that we want to use gradient descent to find the minimum. Everything else is dealt with by TensorFlow.

Finally, we plot the results.



In [ ]:

    
xMax = max(X) + 0.2
plt.figure(figsize=(12, 10))
sns.set(style='darkgrid')
plt.scatter(X, Y, c='b')
plt.plot([0, xMax], [0, theta * xMax], c='r')
plt.xlabel('engine displacement in cubic inches')
plt.ylabel('fuel consumption in litres per 100 km')
plt.title('Fuel Consumption versus Engine Displacement')



In [ ]: